Red Wine Exploration by Paulo Casaretto

Univariate Plots Section

[1] “Dataset variables”

## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
##  $ rating              : Ord.factor w/ 3 levels "bad"<"average"<..: 2 2 2 2 2 2 2 3 3 2 ...
## [1] "Dataset structure"
Dataset summary (continued below)
X fixed.acidity volatile.acidity citric.acid
Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
Table continues below
residual.sugar chlorides free.sulfur.dioxide
Min. : 0.900 Min. :0.01200 Min. : 1.00
1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
Median : 2.200 Median :0.07900 Median :14.00
Mean : 2.539 Mean :0.08747 Mean :15.87
3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
Max. :15.500 Max. :0.61100 Max. :72.00
Table continues below
total.sulfur.dioxide density pH sulphates
Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
alcohol quality rating
Min. : 8.40 3: 10 bad : 63
1st Qu.: 9.50 4: 53 average:1319
Median :10.20 5:681 good : 217
Mean :10.42 6:638 NA
3rd Qu.:11.10 7:199 NA
Max. :14.90 8: 18 NA

First I’m going to explore each individual distribution to get a feel for the data. This will also help me choose the kind of assumptions I can make when applyting statistical tests.

The high concentration of wines in the center region and the lack of outliers might be a problem for generating a predicting model later on.

There is a high concentration of wines with fixed.acidity close to 8 (the median) but there are also some outliers that shift the mean up to 9.2.

The distribution appears bimodal at 0.4 and 0.6 with some outliers in the higher ranges.

Now this is strange distribution. 8% of wines do not present critic acid at all. Maybe a problem in the data collection process?

A high concentration of wines around 2.2 (the median) with some outliers along the higher ranges.

We see a similar distribution with chlorides.

The distributions peaks at around 7 and from then on resembles a long tailed distribution with very few wines over 60.

As expected, this distribution resembles closely the last one.

The distribution for density has a very normal appearence.

pH also looks normally distributed.

For sulphates we see a distribution similar to the ones of residual.sugar and chlorides.

We see the same rapid increase and then long tailed distribution as we saw in sulfur.dioxide. I wonder if there is a correlation between the variables.

Univariate Analysis

What is the structure of your dataset?

There are 1599 observation of wines in the dataset with 12 features . There is one categorical variable (quality) and the others are numerical variables that indicate wine physical and chemical properties of the wine.

Other observations: The median quality is 6, which in the given scale (1-10) is a mediocre wine. The better wine in the sample has a score of 8, and the worst has a score of 3. The dataset is not balanced, that is, there are a more average wines than poor or excelent ones and this might prove challenging when designing a predicting algorithm.

What is/are the main feature(s) of interest in your dataset?

The main feature in the data is quality. I’d like to determine which features determine the quality of wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The variables related to acidity (fixed, volatile, citric.acid and pH) might explain some of the variance. I suspect the different acid concentrations might alter the taste of the wine. Also, residual.sugar dictates how sweet a wine is and might also have an influence in taste.

Did you create any new variables from existing variables in the dataset?

I created a rating variable to improve the later visualizations.

Of the features you investigated, were there any unusual distributions? Did

you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Citric.acid stood out from the other distributions. It had (apart from some outliers) an retangularly looking distribution which given the wine quality distribution seems very unexpected.

Bivariate Plots Section

A correlation table for all variables will help understand the relationships between them.

Table continues below
  fixed.acidity volatile.acidity
fixed.acidity 1 -0.2561
volatile.acidity -0.2561 1
citric.acid 0.6717 -0.5525
residual.sugar 0.1148 0.001918
chlorides 0.09371 0.0613
free.sulfur.dioxide -0.1538 -0.0105
total.sulfur.dioxide -0.1132 0.07647
density 0.668 0.02203
pH -0.683 0.2349
sulphates 0.183 -0.261
alcohol -0.06167 -0.2023
quality 0.1241 -0.3906
Table continues below
  citric.acid residual.sugar
fixed.acidity 0.6717 0.1148
volatile.acidity -0.5525 0.001918
citric.acid 1 0.1436
residual.sugar 0.1436 1
chlorides 0.2038 0.05561
free.sulfur.dioxide -0.06098 0.187
total.sulfur.dioxide 0.03553 0.203
density 0.3649 0.3553
pH -0.5419 -0.08565
sulphates 0.3128 0.005527
alcohol 0.1099 0.04208
quality 0.2264 0.01373
Table continues below
  chlorides free.sulfur.dioxide
fixed.acidity 0.09371 -0.1538
volatile.acidity 0.0613 -0.0105
citric.acid 0.2038 -0.06098
residual.sugar 0.05561 0.187
chlorides 1 0.005562
free.sulfur.dioxide 0.005562 1
total.sulfur.dioxide 0.0474 0.6677
density 0.2006 -0.02195
pH -0.265 0.07038
sulphates 0.3713 0.05166
alcohol -0.2211 -0.06941
quality -0.1289 -0.05066
Table continues below
  total.sulfur.dioxide density
fixed.acidity -0.1132 0.668
volatile.acidity 0.07647 0.02203
citric.acid 0.03553 0.3649
residual.sugar 0.203 0.3553
chlorides 0.0474 0.2006
free.sulfur.dioxide 0.6677 -0.02195
total.sulfur.dioxide 1 0.07127
density 0.07127 1
pH -0.06649 -0.3417
sulphates 0.04295 0.1485
alcohol -0.2057 -0.4962
quality -0.1851 -0.1749
Table continues below
  pH sulphates alcohol
fixed.acidity -0.683 0.183 -0.06167
volatile.acidity 0.2349 -0.261 -0.2023
citric.acid -0.5419 0.3128 0.1099
residual.sugar -0.08565 0.005527 0.04208
chlorides -0.265 0.3713 -0.2211
free.sulfur.dioxide 0.07038 0.05166 -0.06941
total.sulfur.dioxide -0.06649 0.04295 -0.2057
density -0.3417 0.1485 -0.4962
pH 1 -0.1966 0.2056
sulphates -0.1966 1 0.09359
alcohol 0.2056 0.09359 1
quality -0.05773 0.2514 0.4762
  quality
fixed.acidity 0.1241
volatile.acidity -0.3906
citric.acid 0.2264
residual.sugar 0.01373
chlorides -0.1289
free.sulfur.dioxide -0.05066
total.sulfur.dioxide -0.1851
density -0.1749
pH -0.05773
sulphates 0.2514
alcohol 0.4762
quality 1

Alcohol has negative correlation with density. This is expected as alcohol is less dense than water.

Volatile.acidity has a positive correlation with pH. This is unexpected as pH is a direct measure of acidity. Maybe the effect of a lurking variable?

Residual.sugar does not show correlation with quality. Free.sulfur.dioxide and total.sulfur.dioxide are highly correlated as expected.

Density has a very strong correlation with fixed.acidity. The variables that have the strongest correlations to quality are volatile.acidity and alcohol.

Let’s use boxplots to further examine the relationship between some varibles and quality.

Summaries for fixed.acidity grouped by quality
quality mean median
3 8.36 7.5
4 7.779 7.5
5 8.167 7.8
6 8.347 7.9
7 8.872 8.8
8 8.567 8.25

As the correlation table showed, fixed.acidity seems to have little to no effect on quality.

Summaries for volatile.acidity grouped by quality
quality mean median
3 0.8845 0.845
4 0.694 0.67
5 0.577 0.58
6 0.4975 0.49
7 0.4039 0.37
8 0.4233 0.37

volatile.acidity seems to be an unwanted feature is wines. Quality seems to go up when volatile.acidity goes down. The higher ranges seem to produce more average and poor wines.

Summaries for citric.acid grouped by quality
quality mean median
3 0.171 0.035
4 0.1742 0.09
5 0.2437 0.23
6 0.2738 0.26
7 0.3752 0.4
8 0.3911 0.42

We can see the soft correlation between these two variables. Better wines tend to have higher concentration of citric acid.

Summaries for residual.sugar grouped by quality
quality mean median
3 2.635 2.1
4 2.694 2.1
5 2.529 2.2
6 2.477 2.2
7 2.721 2.3
8 2.578 2.1

Contrary to what I initially expected residual.sugar apparently seems to have little to no effect on perceived quality.

Summaries for chlorides grouped by quality
quality mean median
3 0.1225 0.0905
4 0.09068 0.08
5 0.09274 0.081
6 0.08496 0.078
7 0.07659 0.073
8 0.06844 0.0705

Altough weakly correlated, a lower concentration of chlorides seem to produce better wines.

Summaries for free.sulfur.dioxide grouped by quality
quality mean median
3 11 6
4 12.26 11
5 16.98 15
6 15.71 14
7 14.05 11
8 13.28 7.5

The ranges are really close to each other but it seems too little sulfur dioxide and we get a poor wine, too much and we get an average wine.

Summaries for total.sulfur.dioxide grouped by quality
quality mean median
3 24.9 15
4 36.25 26
5 56.51 47
6 40.87 35
7 35.02 27
8 33.44 21.5

As a superset of free.sulfur.dioxide there is no surprise to find a very similar distribution here.

Summaries for density grouped by quality
quality mean median
3 0.9975 0.9976
4 0.9965 0.9965
5 0.9971 0.997
6 0.9966 0.9966
7 0.9961 0.9958
8 0.9952 0.9949

Better wines tend to have lower densities, but this is probably due to the alcohol concentration. I wonder if density still has an effect if we hold alcohol constant.

Summaries for pH grouped by quality
quality mean median
3 3.398 3.39
4 3.382 3.37
5 3.305 3.3
6 3.318 3.32
7 3.291 3.28
8 3.267 3.23

Altough there is definitely a trend (better wines being more acid) there are some outliers.I wonder how the distribution of the different acids affects this.

Let’s examine how each acid concentration affects pH.

It is really strange that an acid concentration would have a positive correlation with pH. Maybe Simpsons Paradox?

When we clusterize the data and recalculate the regression coefficients there is change in sign which indicated that there is in fact a lurking variable that distorts the overall coefficient, indicating the presence of Simpsons Paradox.

Because we know pH measures acid concentration using a log scale, it is not surprise to find stronger correlations between pH the log of the acid concentrations. We can investigate how much of the variance in pH these tree acidity variables can explain using a linear model.

## 
## Call:
## lm(formula = pH ~ I(log10(citric.acid)) + I(log10(volatile.acidity)) + 
##     I(log10(fixed.acidity)), data = subset(wine, citric.acid > 
##     0))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47184 -0.06318 -0.00003  0.06447  0.32265 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 4.230862   0.040578 104.266  < 2e-16 ***
## I(log10(citric.acid))      -0.052187   0.008797  -5.933 3.72e-09 ***
## I(log10(volatile.acidity)) -0.049788   0.021248  -2.343   0.0193 *  
## I(log10(fixed.acidity))    -1.071983   0.038987 -27.496  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1068 on 1463 degrees of freedom
## Multiple R-squared:  0.4876, Adjusted R-squared:  0.4866 
## F-statistic: 464.1 on 3 and 1463 DF,  p-value: < 2.2e-16

It seems the three acidity variables can only explain half the variance in PH. The mean error is specially bad on poor and on excellent wines. This leads me to believe that there are other component that affect acidity.

Interesting. Altough there are many outliers in the medium wines, better wines seem to have a higher concentration of sulphates.

The correlation is clear here. With an increase in alcohol graduation we see an increase in the concentration of better graded wines. Given the high number of outliers it seems we cannot rely on alcohol alone to produce better wines. Let’s try using a simple linear model to investigate.

## 
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8442 -0.4112 -0.1690  0.5166  2.5888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.12503    0.17471  -0.716    0.474    
## alcohol      0.36084    0.01668  21.639   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared:  0.2267, Adjusted R-squared:  0.2263 
## F-statistic: 468.3 on 1 and 1597 DF,  p-value: < 2.2e-16

Based on the R-squared value it seems alcohol alone only explains about 22% of the variance in quality. We’re going to need to look at the other variables to generate a better model.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features in the dataset?

Fixed.acidity seems to have little to no effect on quality

Quality seems to go up when volatile.acidity goes down. The higher ranges seem to produce more average and poor wines.

Better wines tend to have higher concentration of citric acid.

Contrary to what I initially expected residual.sugar apparently seems to have little to no effect on perceived quality.

Altough weakly correlated, a lower concentration of chlorides seem to produce better wines.

Better wines tend to have lower densities.

In terms of pH it seems better wines are more acid but there were many outliers. Better wines also seem to have a higher concentration of sulphates.

Alcohol graduation has a strong correlation with quality, but like the linear model showed us it cannot explain all the variance alone. We’re going to need to look at the other variables to generate a better model.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I verified the strong relation between free and total sulfur.dioxide.

I also checked the relation between the acid concentration and pH. Of those, only volatile.acidity surprised me with a positive coefficient for the linear model.

What was the strongest relationship you found?

The relationship between the variables total.sulfur.dioxide and free.sulfur.dioxide.

Multivariate Plots Section

Alcohol and other variables

Lets try using multivariate plots to answer some questions that arised earlier and to look for other relationships in the data.

When we hold alcohol constant, there is no evidence that density affects quality which confirms our earlier suspicion.

Interesting! It seems that for wines with high alcohol content, having a higher concentration of sulphates produces better wines.

The reverse seems to be true for volatile acidity. Having less acetic acid on higher concentration of alcohol seems to produce better wines.

Low pH and high alcohol concentration seem to be a good match.

Acid exploration

Using multivariate plots we should be able to investigate further the relationship between the acids and quality.

Almost no variance in the y axis compared to the x axis. Lets try the other acids.

High citric acid and low acetic acid seems like a good combination.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$citric.acid and wine$fixed.acidity
## t = 36.2341, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

Altough there seems to a correlation between tartaric acid and citric acid concentrations, nothing stands out in terms of quality.

Linear model

Now I’m going to use the most prominent variables to generate some linear models and compare them.

## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity, 
##     data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity + 
##     citric.acid, data = training_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity + 
##     citric.acid + fixed.acidity, data = training_data)
## m6: lm(formula = as.numeric(quality) ~ alcohol + sulphates + pH, 
##     data = training_data)
## 
## =============================================================================
##                      m1        m2        m3        m4        m5        m6    
## -----------------------------------------------------------------------------
## (Intercept)       -0.066    -0.604**   0.605*    0.670**   0.294     1.328*  
##                   (0.220)   (0.224)   (0.248)   (0.257)   (0.289)   (0.516)  
## alcohol            0.357***  0.339***  0.306***  0.305***  0.315***  0.362***
##                   (0.021)   (0.020)   (0.020)   (0.020)   (0.020)   (0.021)  
## sulphates                    1.099***  0.745***  0.770***  0.780***  0.980***
##                             (0.138)   (0.137)   (0.139)   (0.138)   (0.139)  
## volatile.acidity                      -1.199*** -1.272*** -1.333***          
##                                       (0.125)   (0.146)   (0.147)            
## citric.acid                                     -0.128    -0.436*            
##                                                 (0.130)   (0.170)            
## fixed.acidity                                              0.047**           
##                                                           (0.017)            
## pH                                                                  -0.631***
##                                                                     (0.152)  
## -----------------------------------------------------------------------------
## R-squared             0.232    0.280     0.343     0.344     0.349     0.293 
## adj. R-squared        0.231    0.279     0.341     0.341     0.346     0.291 
## sigma                 0.704    0.682     0.651     0.651     0.649     0.676 
## F                   289.048  185.949   166.182   124.873   102.212   131.779 
## p                     0.000    0.000     0.000     0.000     0.000     0.000 
## Log-likelihood    -1022.548 -991.540  -947.687  -947.203  -943.227  -983.004 
## Deviance            473.685  444.023   405.216   404.808   401.465   436.188 
## AIC                2051.096 1991.080  1905.374  1906.407  1900.454  1976.008 
## BIC                2065.693 2010.544  1929.704  1935.602  1934.516  2000.337 
## N                   959      959       959       959       959       959     
## =============================================================================

Notice I did not include pH in the same formula with the acids to avoid colinearity problems.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

High alcohol contents and high sulphate concentrations combined seem to produce better wines.

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created several models. The most prominent of them was composed of the variables alcohol, sulphates, and the acid variables. There are two problems with it. First the low R squared score suggest that there is missing information to propely predict quality. Second, both the residuals plot and the cross validation favors average wines. This is probably a reflection of the high number of average wines in the training dataset or it could mean that there is missing information that would help predict the edge cases. I hope that the next course in the nanodegree will help me generate better models :) .


Final Plots and Summary

Plot One

Description One

This is a very strange distribution. It does not match what we would expect from a variable collected in a experimental situation.

Plot Two

Description Two

High alcohol contents and high sulphate concentrations combined seem to produce better wines.

Plot Three

Description Three

The linear model with the highest R squared value could only explain around 35% of the variance in quality. Also, the clear correlation showed by the residual plot earlier seems to reinforce that there is missing information to better predict both poor and excellent wines.


Reflection

The wine data set contains information on the chemical properties of a selection of wines collected in 2009. It also includes sensorial data (wine ranking).

I started by looking at the individual distributions of the variables, trying to get a feel for each one.

The first thing I noticed was the high concentration of wines in the middle ranges of the ranking, that is, average tasting wines. This proved to be very problematic during the analysis as I kept questioning myself wether there was a true correlation between two variables or it was just a coincidence given the lack of “outlier” (poor and excellent) wines.

Out of the chemical varibles, the only one that stood out was the concentration of citric acid (variable name citric.acid). First thing i noticed was the high number of wines that had no citric.acid at all. My initial thought was a data collection error, but upon researching the subject, I found out that citric acid is sometimes added to wines to boost overall acidity, so it makes sense that some wines would have none. Nonetheless this variable also showed a strange distribution with some peaks but showing an almost rectangular distribution specially in the 0-0.5 range.

All of the other variables showed either an normal or long tailed looking distribution.

After exploring the individual variables, I proceded to investigate the relationships between each input variable and the outcome variable quality.

The most promissing varibles were alcohol concentration, sulphates and the individual acid concentrations.

I also tried investigating the effect of each acid in the overall pH for the wine. I used scatterplots to explore the relationships graphically and also generated a linear model to check how much of pH the three variables accounted for.

The first surprise here was finding that the correlation between acetic acid concentration and pH was positive. I immediately suspected this was the result of some lurking variable (Simpsons paradox) and with the help of the “Simpsons” package I confirmed that suspicion.

The second finding was discovering that the concentration of the three acids only account for less than half of the variance in pH. I interpreted this as a sign that there more components affecting acidity that were not measured.

On the final part of the analysis I tried using multivariate plots to investigate if there were interesting combinations of variables that might affect quality. I also used a multivariate plot to confirm that density did not have an effect on quality when holding alcohol concentration constant.

In the end, the produced model could not explain much of the variance in quality. This is further corroborated acidity analysis.

For future studies, it would be interesting to mesure more acid types in the analysis. Wikipedia for example, suggests that malic and lactic acid are important in wine taste and these were not included in this sample.

Also, I think it would be interesting to include each wine critic judgement as separate entry in the dataset. After all, each individual has a different taste and is subject to prejudice and other distorting factors. I believe that having this extra information would add more value to the analysis.